Class 1 Objectives

  • Create a PDF document and a website to communicate your own analysis
  • Including your own text, analysis, table and chart
  • Using data from an external source (a CSV file)
  • Understand the data science workflow in R
  • Gain confidence in R

Why R?

  • Designed for data science
  • A lingua franca of social science
  • Easy to make professional outputs (tables, charts, maps) to promote FGV/CEPESP
  • ANY question you have has already been answered online

Organizing your Analysis:

  1. Use RStudio for everything. R is the engine; RStudio is the Interface. If you do half your data cleaning in Excel, there will be no record of it and we won't be able to fix mistakes.

  2. Our work is in the form of a 'Recipe Book': A step-by-step guide to the data inputs (the ingredients), our analysis (the cooking instructions) and the outputs (the picture of the perfect meal).

  3. Use R Markdown (.rmd) files to combine analysis and outputs: R markdown files allow us to combine data processing in R - in clearly-defined 'chunks' - with the display of text, tables, charts and maps.

Organizing your Analysis:

  1. Output is produced only when we press 'Knit': It is NOT an interactive playground like excel (though it can be with Ctrl+Enter).

  2. Our work should be self-explanatory and reproducible: Anybody with R should be able to open our work, press 'Knit' and produce the same outputs.

  3. Organize your work in Projects in R: For each major analysis, it's best to choose 'File' -> 'New Project' -> 'New Directory' from Rstudio. Save all your data inputs and outputs in this folder (which Rstudio will do automatically).

Organizing your Analysis:

  1. Data frames (tables) are the main building block of our analysis: We focus on manipulating and visualizing tables of data, as these are the best way of organizing our data.

  2. Use meaningful names in your work: 'data_v1b_2_061215' won't mean anything in 3 months! All files and objects should reflect their role in the analysis.

  3. Process our data in a 'tidy' way: This means we will use a set of compatible 'packages' called the 'tidyverse' to make our analysis transparent and avoid common problems.

Basic Tools in Rstudio

  • Text vs. Code Chunks
    • Type text directly
    • Insert -> 'R' creates a code chunk
    • Chunks contain data processing and outputs (tables, charts etc.)
    • Use a separate chunk for each output

Basic Tools in Rstudio

  • Formatting Text
    • Cheat Sheet
    • Equations: In latex format, eg. $$ a^2 + b^2 = c^2 $$

\[ a^2 + b^2 = c^2 \]

Basic Tools in Rstudio

  • Data analysis within code chunks:
    • Assigning to saved objects: new_object <- old_object
    • Inspecting objects interactively: Type their name and press 'Ctrl-Enter'
    • Processing objects: data_frame %>% action_on_dataframe
    • Comments: #Comments go here and won't be processed by R
    • The actions (functions) we can use depend on the packages we have loaded:
      • install.packages("New_package") ONCE, then
      • library("New_package") at the start of each document

Basic Tools in Rstudio

  • Basic Maths in code chunks:
answer <- 2 + 2
answer
## [1] 4
inputs <- seq(0,1,0.2)

answer <- inputs*10
answer
## [1]  0  2  4  6  8 10

Creating your Document Output

  • Once we have all the text and chunk outputs ready, Knit!
    • Knit to PDF
    • Knit to HTML
  • Set the title, date, author etc. in the header in markdown

  • Remember, we can produce Documents or Presentations
    • When we first create a new markdown file

Workflow

  1. Dataframe - A table from a file - read_csv
  2. Process data
    • Cleaning - filter, select, mutate
    • Combine datasets - left_join
    • Create measures/statistics - mutate, summarize
  3. Run regression - zelig
  4. Create results
    • Table - kable, stargazer
    • Graph - ggplot
    • Map - leaflet, mapview

Workflow

Load a Dataframe

  • Specific packages let us access APIs for online data, eg.
  • Or from local files using read_csv
data <- read_csv("data.csv")
  • To open SPSS or Stata files:
library(foreign)
data <- read.spss("data.sav")
data <- read.dta("data.dta")

Dataframes

  • Variable Names
  • Observations
  • Values and variable types

Workflow Example

flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay))
%>%
kable()

flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay))
%>%
ggplot() + geom_point(aes(x=air_time,y=dep_delay))

flights %>%
group_by(carrier) %>%
summarize(dep_delay=mean(dep_delay))
%>%
zelig(dep_delay ~ carrier,data=.,model="ls") %>%
stargazer(digits=3)

Data Processing: Example Actions on our Dataframe

  • select specific variables (columns)
  • slice observations (rows)
  • filter observations (rows) by conditions (based on values in columns)
  • count number of observations
  • rename variables (columns)
  • arrange table in the order of a particular variable
  • mutate (change) values of an existing or create a new variable
  • summarize data by creating statistics
  • round values to a specifc number of decimal places

Data Processing: Example Actions on our Dataframe

flights %>% 
  select(carrier,origin,air_time,distance,dep_delay) 

Data Processing: Example Actions on our Dataframe

flights %>% 
  select(carrier,origin,air_time,distance,dep_delay) 
## # A tibble: 5 x 1
##   air_time
##      <dbl>
## 1      227
## 2      227
## 3      160
## 4      183
## 5      116

Data Processing: Example Actions on our Dataframe

flights %>% 
  slice(1:2)

Data Processing: Example Actions on our Dataframe

flights %>% 
  slice(1:2)
## # A tibble: 2 x 5
##   carrier origin air_time distance dep_delay
##     <chr>  <chr>    <dbl>    <dbl>     <dbl>
## 1      UA    EWR      227     1400         2
## 2      UA    LGA      227     1416         4

Data Processing: Example Actions on our Dataframe

flights %>% 
  filter(origin=="JFK")

Data Processing: Example Actions on our Dataframe

flights %>% 
  filter(origin=="JFK")
## # A tibble: 2 x 5
##   carrier origin air_time distance dep_delay
##     <chr>  <chr>    <dbl>    <dbl>     <dbl>
## 1      AA    JFK      160     1089         2
## 2      B6    JFK      183     1576        -1

Data Processing: Example Actions on our Dataframe

flights %>% 
  mutate(air_time=round(air_time/60,3))

Data Processing: Example Actions on our Dataframe

flights %>% 
  mutate(air_time=round(air_time/60,3))
## # A tibble: 5 x 5
##   carrier origin air_time distance dep_delay
##     <chr>  <chr>    <dbl>    <dbl>     <dbl>
## 1      UA    EWR    3.783     1400         2
## 2      UA    LGA    3.783     1416         4
## 3      AA    JFK    2.667     1089         2
## 4      B6    JFK    3.050     1576        -1
## 5      DL    LGA    1.933      762        -6

Data Processing: Example Actions on our Dataframe

flights %>% 
  mutate(speed=round(distance/air_time,3))

Data Processing: Example Actions on our Dataframe

flights %>% 
  mutate(speed=round(distance/air_time,3))
## # A tibble: 5 x 6
##   carrier origin air_time distance dep_delay speed
##     <chr>  <chr>    <dbl>    <dbl>     <dbl> <dbl>
## 1      UA    EWR      227     1400         2 6.167
## 2      UA    LGA      227     1416         4 6.238
## 3      AA    JFK      160     1089         2 6.806
## 4      B6    JFK      183     1576        -1 8.612
## 5      DL    LGA      116      762        -6 6.569

Data Processing: Example Actions on our Dataframe

flights %>% 
  summarize(avg_distance=mean(distance,na.rm=TRUE))

Data Processing: Example Actions on our Dataframe

flights %>% 
  summarize(avg_distance=mean(distance,na.rm=TRUE))
## # A tibble: 1 x 1
##   avg_distance
##          <dbl>
## 1       1248.6

Data Processing Example

These actions can be 'piped' together:
We want to find the average speed of United (UA) flights.

In steps: Take the data, filter the data to carrier UA,
calculate the speed of each flight,
and then find the average.

flights %>% filter(carrier=="UA") %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% round(1)

flights %>% filter(carrier=="UA") %>% 
  mutate(speed=distance/(air_time/60)) %>% 
  summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
  round(1) %>% as.numeric()
## [1] 420.9

Data Processing Example + In-line

These actions can be 'piped' together:
We want to find the average speed of United (UA) flights.

In steps: Take the data, filter the data to carrier UA,
calculate the speed of each flight,
and then find the average.

avg_speed <- flights %>% filter(carrier=="UA") %>% 
  mutate(speed=distance/(air_time/60)) %>% 
  summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
  round(1)

The average speed of United Flights is `r avg_speed` miles per hour.

The average speed of United Flights is 420.9 miles per hour.

Data Processing Example + In-line

Can we change the order of data processing?

In steps: Take the data, calculate the speed of each flight,
filter the data to carrier UA,
and then find the average.

flights %>% mutate(speed=distance/(air_time/60)) %>% filter(carrier=="UA") %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)

Data Processing Example + In-line

Can we change the order of data processing?

In steps: Take the data, calculate the speed of each flight,
filter the data to carrier UA,
and then find the average.

flights %>% mutate(speed=distance/(air_time/60)) %>% filter(carrier=="UA") %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>%
round(1)

420.9

Data Processing Example + In-line

Can we change the order of data processing?

In steps: Take the data, calculate the speed of each flight,
find the average,
and then filter the data to carrier UA,

flights %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% filter(carrier=="UA") %>%
round(1)

Data Processing Example + In-line

Can we change the order of data processing?

In steps: Take the data, calculate the speed of each flight,
find the average,
and then filter the data to carrier UA,

flights %>% mutate(speed=distance/(air_time/60)) %>% summarize(avg_speed=mean(speed,na.rm=TRUE)) %>% filter(carrier=="UA") %>%
round(1)

394.3

Table Outputs

flights %>% slice(1:5) %>% 
  select(carrier,origin,air_time,distance,dep_delay) %>%
  kable()
carrier origin air_time distance dep_delay
UA EWR 227 1400 2
UA LGA 227 1416 4
AA JFK 160 1089 2
B6 JFK 183 1576 -1
DL LGA 116 762 -6

Table Outputs

flights %>% slice(1:5) %>% 
  select(carrier,origin,air_time,distance,dep_delay) %>%
  kable(caption="Example Table", align="lcccc")
Example Table
carrier origin air_time distance dep_delay
UA EWR 227 1400 2
UA LGA 227 1416 4
AA JFK 160 1089 2
B6 JFK 183 1576 -1
DL LGA 116 762 -6

Chart Outputs

flights %>%
  filter(carrier=="UA") %>%
  ggplot() + geom_point(aes(x=dep_time,y=dep_delay))

Chart Outputs

flights %>%
  filter(carrier=="UA") %>%
  ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) +
  geom_smooth(aes(x=dep_time,y=dep_delay))

Chart Outputs

flights %>%
  filter(carrier=="UA") %>%
  ggplot() + geom_point(aes(x=dep_time,y=dep_delay)) +
  geom_smooth(aes(x=dep_time,y=dep_delay)) +
  ggtitle("Example Chart") +
  xlab("Departure Time") +
  ylab("Departure Delay")

Chart Outputs

flights %>%
  ggplot() + geom_bar(aes(x=dep_delay))

Chart Outputs

flights %>%
  ggplot() + geom_bar(aes(x=dep_delay)) +
  xlim(-30,100)

Chart Outputs

flights %>%
  group_by(origin) %>%
  summarize(avg_delay=mean(dep_delay,na.rm=TRUE)) %>%
  ggplot() + geom_col(aes(x=origin, y=avg_delay))